aditya ramesh
Kandinsky 3.0 Technical Report
Arkhipkin, Vladimir, Filatov, Andrei, Vasilev, Viacheslav, Maltseva, Anastasia, Azizov, Said, Pavlov, Igor, Agafonova, Julia, Kuznetsov, Andrey, Dimitrov, Denis
We present Kandinsky 3.0, a large-scale text-to-image generation model based on latent diffusion, continuing the series of text-to-image Kandinsky models and reflecting our progress to achieve higher quality and realism of image generation. Compared to previous versions of Kandinsky 2.x, Kandinsky 3.0 leverages a two times larger U-Net backbone, a ten times larger text encoder and removes diffusion mapping. We describe the architecture of the model, the data collection procedure, the training technique, and the production system of user interaction. We focus on the key components that, as we have identified as a result of a large number of experiments, had the most significant impact on improving the quality of our model compared to the others. By our side-by-side comparisons, Kandinsky becomes better in text understanding and works better on specific domains. Project page: https://ai-forever.github.io/Kandinsky-3
- North America > United States > Maryland > Baltimore (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
- Europe > France > Hauts-de-France > Nord > Lille (0.04)
Diversity is Definitely Needed: Improving Model-Agnostic Zero-shot Classification via Stable Diffusion
Shipard, Jordan, Wiliem, Arnold, Thanh, Kien Nguyen, Xiang, Wei, Fookes, Clinton
In this work, we investigate the problem of Model-Agnostic Zero-Shot Classification (MA-ZSC), which refers to training non-specific classification architectures (downstream models) to classify real images without using any real images during training. Recent research has demonstrated that generating synthetic training images using diffusion models provides a potential solution to address MA-ZSC. However, the performance of this approach currently falls short of that achieved by large-scale vision-language models. One possible explanation is a potential significant domain gap between synthetic and real images. Our work offers a fresh perspective on the problem by providing initial insights that MA-ZSC performance can be improved by improving the diversity of images in the generated dataset. We propose a set of modifications to the text-to-image generation process using a pre-trained diffusion model to enhance diversity, which we refer to as our $\textbf{bag of tricks}$. Our approach shows notable improvements in various classification architectures, with results comparable to state-of-the-art models such as CLIP. To validate our approach, we conduct experiments on CIFAR10, CIFAR100, and EuroSAT, which is particularly difficult for zero-shot classification due to its satellite image domain. We evaluate our approach with five classification architectures, including ResNet and ViT. Our findings provide initial insights into the problem of MA-ZSC using diffusion models. All code will be available on GitHub.
- Asia > Middle East > Jordan (0.05)
- Oceania > Australia > Queensland (0.04)
Investigating Prompt Engineering in Diffusion Models
Witteveen, Sam, Andrews, Martin
With the spread of the use of Text2Img diffusion models such as DALL-E 2, Imagen, Mid Journey and Stable Diffusion, one challenge that artists face is selecting the right prompts to achieve the desired artistic output. We present techniques for measuring the effect that specific words and phrases in prompts have, and (in the Appendix) present guidance on the selection of prompts to produce desired effects.
Language Does More Than Describe: On The Lack Of Figurative Speech in Text-To-Image Models
Kleinlein, Ricardo, Luna-Jiménez, Cristina, Fernández-Martínez, Fernando
The impressive capacity shown by recent text-to-image diffusion models to generate high-quality pictures from textual input prompts has leveraged the debate about the very definition of art. Nonetheless, these models have been trained using text data collected from content-based labelling protocols that focus on describing the items and actions in an image, but neglect any subjective appraisal. Consequently, these automatic systems need rigorous descriptions of the elements and the pictorial style of the image to be generated, otherwise failing to deliver. As potential indicators of the actual artistic capabilities of current generative models we characterise the sentimentality, objectiveness and degree of abstraction of publicly available text data used to train current text-to-image diffusion models. Considering the sharp difference observed between their language style and that typically employed in artistic contexts, we suggest generative models should incorporate additional sources of subjective information in their training in order to overcome (or at least to alleviate) some of their current limitations, thus effectively unleashing a truly artistic and creative generation.
- Europe > Spain > Galicia > Madrid (0.05)
- North America > United States > California (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
How Does DALL·E-2 Work?
DALL·E-2 is a new AI system that can create realistic images and art from a description in natural language. Recently OpenAI just releases the beta version of DALL·E-2. In this article, we will take a close look at the original research paper of DALL·E-2 and understand how exactly it works. DALL·E-2 originates from this paper: Hierarchical Text-Conditional Image Generation with CLIP Latents [1]. DALL·E-2 is based on the unCLIP model proposed in this paper.
OpenAI's new image generator sparks both excitement and fear
OpenAI has unveiled a new AI tool that turns text into images -- and the results are stunning. Named DALL-E 2, the system is the successor to a model unveiled last year. While its predecessor generated some impressive outputs, the new version is a major upgrade. DALL-E-2 adds enhanced textual comprehension, faster image generation, and four times greater resolution. "When approaching DALL-E 2 we focused on improving the image resolution quality and improving latency, rather than building a bigger system," OpenAI researcher Aditya Ramesh told TNW.
DALL·E: Creating Images from Text
DALL·E[1] is a 12-billion parameter version of GPT-3 trained to generate images from text descriptions, using a dataset of text–image pairs. We've found that it has a diverse set of capabilities, including creating anthropomorphized versions of animals and objects, combining unrelated concepts in plausible ways, rendering text, and applying transformations to existing images. GPT-3 showed that language can be used to instruct a large neural network to perform a variety of text generation tasks. Image GPT showed that the same type of neural network can also be used to generate images with high fidelity. We extend these findings to show that manipulating visual concepts through language is now within reach.